• Readme
  • Read files
  • Processing the pseudo counts tables
  • Session info

Readme

Hi Shinya,

I uploaded the first part of count matrices of ROSMAP single-nucleus RNAseq to Google Drive. I have sent invitation to the Google folder separately.

Because my storage space is limited, I will transfer data twice. The first part contains 240 folders and 1 annotation file. Please let me know when you finish downloading them. I will delete them and upload the second part.

Each folder is named for ROSMAP projid and contains a count matrix of nuclei from the individual.
matrix.mtx.gz: UMI count matrix in the MatrixMarket format.
barcodes.tsv.gz: droplet barcodes, batch, WGS ID, and projid of nuclei.
features.tsv.gz: Gene ID and gene symbols.

You can read them using the Seurat package of R.
> library(Seurat)
> dat <- ReadMtx(mtx="matrix.mtx.gz", cells="barcodes.tsv.gz", features="features.tsv.gz")
> seu <- CreateSeuratObject(dat)

annotation.2022-01-24.tsv.gz is the latest version of our cell type annotation. This one will be mostly stable, but our collaborator may still add minor tweaks in a next few months. There are several levels of cell type annotations, but we are using cell.type as the top-level cell type, and state as their subtype classifications.

Brief description of data processing:
Nuclei were isolated from 479 DLPFC tissues of ROSMAP. Tissues were processed as 60 batches, and each batch consisted from 8 donors. In each batch, nuclei suspension of 8 donors were mixed together, and single-nucleus RNAseq library was prepared using 10x Genomics 3 Gene Expression kit (v3 chemistry). The libraries were sequenced, and read mapped and UMI counting were performed using CellRanger v6.0.0 with GENCODE v32 and GRCh38.p13. Original donors of droplets in each batch were inferred by comparing SNPs in RNA reads with ROSMAP WGS VCF using genetic demultiplexing software demuxlet. As a quality control, genotype concordance of RNA and WGS, sex check, duplicated donors, WGS QC, and sequencing depth were assessed, and 424 donors passed the QC. To annotate cell types based on single nucleus expression, nuclei were classified into 8 major cell types, and each major cell type were analyzed separately. Doublets were removed using DoubletFinder, and cells were clustered using Seurat.

Best,
Masashi
data_dir = "/pastel/resources/20220203_snRNAseq_AMPAD/"

Read files

# Read phenotype
pheno = readRDS("/pastel/resources/phenotypes/basic_Apr2019.rds")

# Read annotation
annotation = read_tsv(paste0(data_dir,"annotation.2022-01-24.tsv.gz"), show_col_types = FALSE)
# length(unique(annotation$projid)) # 438

# List samples included
files_list = list.dirs(path = data_dir, full.names = F)
files_list = files_list[!files_list==""] 

metadata_df = data.frame(projid = files_list)
# length(unique(metadata_df$projid)) # 424

pheno.filt = pheno$data %>% filter(projid %in% unique(metadata_df$projid))
annotation.filt = annotation %>% filter(projid %in% unique(metadata_df$projid))

cell_types_df = as.data.frame(table(annotation$cell.type)) %>% arrange(-Freq) %>% mutate(Fraction = scales::percent(Freq/sum(Freq))) %>% rename(CellType = Var1)
cell_types_df
ABCDEFGHIJ0123456789
CellType
<fct>
Freq
<int>
Fraction
<chr>
Excitatory Neurons67133540.37335%
Oligodendrocytes33314720.03510%
Inhibitory Neurons25792915.51157%
Astrocyte22892513.76730%
Microglia837295.03537%
OPCs656353.94722%
Endothelial104840.63050%
Fibroblast37050.22281%
Pericytes31320.18836%
Macrophages21310.12816%

We will read each sample’ counts table, and create pseudo-bulk tables for each cell type (at least >1%).

cell_types_df_to_consider = cell_types_df %>% filter(Freq/sum(Freq) > 0.01)

cell_type_ids = list()
for(cell in cell_types_df_to_consider$CellType){
  cell_type_ids[[cell]] <- annotation[annotation$cell.type == cell, ]$barcode
}

read_counts <- function(sample){
  #sample = metadata_df$projid[1]
  mtx_file = paste0(data_dir,sample,"/matrix.mtx.gz")
  cells_file = paste0(data_dir,sample,"/barcodes.tsv.gz")
  features_file = paste0(data_dir,sample,"/features.tsv.gz")
  
  dat <- ReadMtx(mtx=mtx_file, cells=cells_file, features=features_file, feature.column = 1)
  
  pseudo_counts <- list()
  for(cell in names(cell_type_ids)){
    #cell = names(cell_type_ids)[1]
    barcodes = annotation[annotation$cell.type == cell,]$barcode
    cell_selected = which(colnames(dat)%in%barcodes)
    pseudo_counts[[cell]] <- rowSums(dat[,cell_selected])
  }
  pseudo_counts$sample = sample
  return(pseudo_counts)
}

pseudo_counts_list <- furrr::future_map(metadata_df$projid, read_counts, .options = furrr_options(seed = 123))

gene_names = names(pseudo_counts_list[[1]]$`Excitatory Neurons`)
sample_names = unique(metadata_df$projid)

ext_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
inh_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
oli_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
ast_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
mic_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))
opc_pseudo_counts = as.data.frame(matrix(data = NA, nrow = length(gene_names), ncol = length(sample_names), dimnames = list(gene_names,sample_names)))

for(sample_counts in pseudo_counts_list){
  #sample_counts = pseudo_counts_list[[1]]
  ext_pseudo_counts[names(sample_counts$`Excitatory Neurons`),sample_counts$sample] <- sample_counts$`Excitatory Neurons`
  inh_pseudo_counts[names(sample_counts$`Inhibitory Neurons`),sample_counts$sample] <- sample_counts$`Inhibitory Neurons`
  oli_pseudo_counts[names(sample_counts$Oligodendrocytes),sample_counts$sample] <- sample_counts$Oligodendrocytes
  ast_pseudo_counts[names(sample_counts$Astrocyte),sample_counts$sample] <- sample_counts$Astrocyte
  mic_pseudo_counts[names(sample_counts$Microglia),sample_counts$sample] <- sample_counts$Microglia
  opc_pseudo_counts[names(sample_counts$OPCs),sample_counts$sample] <- sample_counts$OPCs
}
# Just checking if total counts are the same
for(sample_counts in pseudo_counts_list){
 if (!(sum(ast_pseudo_counts[,sample_counts$sample]) == sum(sample_counts$Astrocyte))){
   print(sample_counts$sample)
 }
}

save(ext_pseudo_counts,inh_pseudo_counts,oli_pseudo_counts,ast_pseudo_counts,mic_pseudo_counts,opc_pseudo_counts,
    file = paste0(data_dir,"pseudo_counts.RData"))

Processing the pseudo counts tables

library(tidyverse)
library(rtracklayer)
library(GenomicFeatures)
library(BSgenome.Hsapiens.UCSC.hg38)
library(DESeq2)

# process gene and transcript annotation #####
# Using: GENCODE v32 and GRCh38.p13
file = "/pastel/resources/annotation/gencode.v32.primary_assembly.annotation.gtf.gz"

gtf = import(file)
gtf = as.data.frame(gtf)
gtf %>% filter(type%in%c("transcript")) -> gtf.transcripts
gtf %>% filter(type%in%c("gene")) -> gtf.genes

txdb <- makeTxDbFromGFF(file)
ebg <- exonsBy(txdb, by="gene") 
exons <- reduce(ebg)
exons <- exons[ gtf.genes$gene_id ]

seqnames(Hsapiens) <- gsub("(.*?)_(.*)","\\2",gsub("v","\\.",gsub("(.*?)_(.*?)_(.*)","\\2",seqnames(Hsapiens))))
dna <- extractTranscriptSeqs(Hsapiens, exons)

gtf.genes$gc_exon <- as.numeric(letterFrequency(dna, "GC", as.prob=TRUE))
gtf.genes$len_exon <- sum(width(ranges(exons)))

# save(gtf.genes, file = "/pastel/resources/annotation/gencode.v32.primary_assembly.ann.RData")

From Bryois et al medRxiv

Pseudo-bulk gene expression matrices were then generated by summing all counts for each gene in each patient in each cell type and normalized by scaling the total counts per patient for each cell type to 1 million.

Here:

  1. Pseudo-bulk created by summing counts from single cell data
  2. Genes are filtered by each cell type (kept genes with CPM > 1 in 80% of samples)
  3. Perform CQN accounting by exon length and exon gc
  4. Return matrices by Counts (raw), CPM (raw), TPM, CQN, Adj.Counts, TMM-Voom, Quantile-Voom
library(edgeR)
library(cqn)

load("/pastel/resources/annotation/gencode.v32.primary_assembly.ann.RData")
load(paste0(data_dir,"pseudo_counts.RData"))

gtf.genes = gtf.genes %>% mutate(ensembl_id_nover = gsub("(.*)\\.(.*)","\\1",gene_id))

## Batch info
sample_batch = annotation %>% filter(projid %in% metadata_df$projid) %>% dplyr::select(projid, batch) %>% distinct() %>%
  mutate(batch_0 = gsub("(.*)-(.*)","\\1",batch), batch_1 = gsub("(.*?)-(.*)","\\1",batch), batch_2 = gsub("(.*?)-(.*?)-(.*)","\\2",batch)) %>% distinct()

pseudo_counts = list("ext" = ext_pseudo_counts,
                     "inh" = inh_pseudo_counts,
                     "oli" = oli_pseudo_counts,
                     "ast" = ast_pseudo_counts,
                     "mic" = mic_pseudo_counts,
                     "opc" = opc_pseudo_counts)

celltype_exp = list()
for (cell_type in names(pseudo_counts)){
  # cell_type = names(pseudo_counts)[5] # debug
  cell_data = list()
  
  counts = pseudo_counts[[cell_type]]
  
  # Filter genes by CPM
  counts_cpm = as.data.frame(cpm(counts))
  keep.exp <- rowSums(counts_cpm > 1) >= (0.8 * ncol(counts_cpm) )
  counts_cpm_filt = counts_cpm[keep.exp,]
  counts_filt = counts[keep.exp,]

  gtf.genes_filt = gtf.genes[match(rownames(counts_filt), gtf.genes$ensembl_id_nover),]
  # all(rownames(counts_filt)==gtf.genes_filt$ensembl_id_nover) # TRUE?
  
  # TPM
  tpm_x = counts_filt/gtf.genes_filt$len_exon
  tpm = t(t(tpm_x)*1e6/colSums(tpm_x))
  
  sizeFactors = colSums(counts_filt, na.rm = T)
  cqn.subset <- cqn(counts_filt, lengths = gtf.genes_filt$len_exon, x = gtf.genes_filt$gc_exon, sizeFactors = sizeFactors, verbose = TRUE)
  RPKM.cqn.log2 <- cqn.subset$y + cqn.subset$offset

  # convert to count
  adjusted.count <- 2^RPKM.cqn.log2 * rep(colSums(counts_filt), nrow(RPKM.cqn.log2)) %>%
    matrix(., ncol=ncol(counts_filt), nrow=nrow(counts_filt), byrow = T) 
  adjusted.count = adjusted.count/10^6 - 1
  adjusted.count[adjusted.count<0]=0
  
  # quantile-voom
  counts_voom = voom(counts = adjusted.count, normalize.method="quantile")
  quantile_voom = as.data.frame(counts_voom$E) 
  
  # TMM-voom
  norm <- calcNormFactors(counts_filt, method = "TMM") 
  dge = DGEList(counts_filt, norm.factors = norm)
  counts_voom <- voom(dge)
  tmm_voom <- as.data.frame(counts_voom$E) 
  
  cell_data$counts <- counts
  cell_data$counts_cpm <- counts_cpm_filt
  cell_data$counts_tpm <- tpm
  cell_data$cqn <- cqn.subset
  cell_data$RPKM.cqn.log2 <- RPKM.cqn.log2
  cell_data$adjusted.count <- adjusted.count
  cell_data$quantile_voom <- quantile_voom
  cell_data$tmm_voom <- tmm_voom
  
  celltype_exp[[cell_type]] <- cell_data
}

save(celltype_exp, pheno.filt, annotation.filt, file = paste0(data_dir,"celltype_exp.RData"))

Summary based on cqn adj. counts

load(paste0(data_dir,"celltype_exp.RData"))
df_summary = data.frame()
for (cell in names(celltype_exp)){
  #cell = names(celltype_exp)[1]
  dat = celltype_exp[[cell]]
  dat2 = as.data.frame(dat$adjusted.count)
  dat_df = data.frame(cell_type = cell,
                      ngenes = nrow(dat2),
                      nsamples = ncol(dat2),
                      avg_libsize = mean(colSums(dat2)),
                      avg_counts = mean(rowMeans(dat2)))
  df_summary = rbind(df_summary, dat_df)
}
df_summary
ABCDEFGHIJ0123456789
cell_type
<chr>
ngenes
<int>
nsamples
<int>
avg_libsize
<dbl>
avg_counts
<dbl>
ext1683642440891078.02428.78819
inh1646742413641449.2828.41132
oli158594242234842.6140.91952
ast163834244482531.5273.60871
mic14395424621604.443.18197
opc153644241045919.568.07599

Comparison between distributions

library(ggpubr)
library(viridis)

cell_color_map = list(ext = ggsci::pal_nejm("default")(6)[1],
                      inh = ggsci::pal_nejm("default")(6)[2],
                      oli = ggsci::pal_nejm("default")(6)[3],
                      ast = ggsci::pal_nejm("default")(6)[4],
                      mic = ggsci::pal_nejm("default")(6)[5],
                      opc = ggsci::pal_nejm("default")(6)[6])
cell_labels = list(ext = "Excitatory Neurons",
                   inh = "Inhibitory Neurons",
                   oli = "Oligodendrocytes",
                   ast = "Astrocytes",
                   mic = "Microglia",
                   opc = "OPCs")
cell_labels_n = list(ext = paste0("Excitatory Neurons (genes = ", df_summary$ngenes[df_summary$cell_type=="ext"], ")"),
                   inh = paste0("Inhibitory Neurons (genes = ", df_summary$ngenes[df_summary$cell_type=="inh"], ")"),
                   oli = paste0("Oligodendrocytes (genes = ", df_summary$ngenes[df_summary$cell_type=="oli"], ")"),
                   ast = paste0("Astrocytes (genes = ", df_summary$ngenes[df_summary$cell_type=="ast"], ")"),
                   mic = paste0("Microglia (genes = ", df_summary$ngenes[df_summary$cell_type=="mic"], ")"),
                   opc = paste0("OPCs (genes = ", df_summary$ngenes[df_summary$cell_type=="opc"], ")"))

plot_exp_density_per_sample <- function(exp_matrix, title = "", subtitle = "", x_label = "value", color = "#4DBBD5FF"){
  #exp_matrix <- cell_type$RPKM.cqn.log2
  exp_matrix <- as.matrix(exp_matrix)
  nsamples <- ncol(exp_matrix)
  colfunc <- grDevices::colorRampPalette(c(color, alpha(color, 0.5)))
  col = alpha(colfunc(nsamples), alpha = 0.1)
  per_sample = reshape2::melt(exp_matrix)
  density_p <- ggplot(per_sample, aes(x=value, color=as.factor(Var2))) + 
    geom_density(show.legend = FALSE) + 
    scale_fill_viridis_d() +
    scale_color_manual(values=col) +
    labs(x = x_label, y = "Density", title = title, subtitle = subtitle) +
    theme_classic()
  return(density_p)
}

dist_plots = list()
for(cell_type_i in names(celltype_exp)){
  #cell_type_i = names(celltype_exp)[5] # debug
  cell_type <- celltype_exp[[cell_type_i]]
  a1 = plot_exp_density_per_sample(log2(cell_type$counts_cpm+1), subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2]("CPM + 1")), color = cell_color_map[[cell_type_i]])
  a2 = plot_exp_density_per_sample(log2(cell_type$counts_tpm+1), subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2](TPM + 1)), color = cell_color_map[[cell_type_i]])
  a3 = plot_exp_density_per_sample(cell_type$RPKM.cqn.log2, subtitle = cell_labels_n[[cell_type_i]], x_label = expression(log[2]("RPKM CQN")), color = cell_color_map[[cell_type_i]])
  a4 = plot_exp_density_per_sample(cell_type$tmm_voom, subtitle = cell_labels_n[[cell_type_i]], x_label = expression("tmm-voom"), color = cell_color_map[[cell_type_i]])
  dist_plots[[cell_type_i]] <- ggarrange(a1,a2,a3,a4, nrow = 1, ncol = 4)
}

ggarrange(plotlist = dist_plots, nrow = 6, ncol = 1, labels = "auto")

Session info

sessionInfo()

R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Stream 8

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.15.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] viridis_0.6.2 viridisLite_0.4.0 ggpubr_0.4.0 furrr_0.2.3
[5] future_1.24.0 kableExtra_1.3.4 forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.8 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[13] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1 SeuratObject_4.0.4 [17] Seurat_4.1.0

loaded via a namespace (and not attached): [1] readxl_1.4.0 backports_1.4.1 systemfonts_1.0.4
[4] plyr_1.8.7 igraph_1.3.0 lazyeval_0.2.2
[7] splines_4.1.2 listenv_0.8.0 scattermore_0.8
[10] digest_0.6.29 htmltools_0.5.2 fansi_1.0.3
[13] magrittr_2.0.3 tensor_1.5 cluster_2.1.2
[16] ROCR_1.0-11 tzdb_0.3.0 globals_0.14.0
[19] modelr_0.1.8 matrixStats_0.62.0 vroom_1.5.7
[22] svglite_2.1.0 spatstat.sparse_2.1-0 colorspace_2.0-3
[25] rvest_1.0.2 ggrepel_0.9.1 haven_2.4.3
[28] xfun_0.30 crayon_1.5.1 jsonlite_1.8.0
[31] spatstat.data_2.1-4 survival_3.2-13 zoo_1.8-9
[34] glue_1.6.2 polyclip_1.10-0 gtable_0.3.0
[37] webshot_0.5.2 leiden_0.3.9 car_3.0-12
[40] future.apply_1.8.1 abind_1.4-5 scales_1.2.0
[43] DBI_1.1.2 rstatix_0.7.0 spatstat.random_2.2-0 [46] miniUI_0.1.1.1 Rcpp_1.0.8.3 xtable_1.8-4
[49] reticulate_1.24 spatstat.core_2.4-2 bit_4.0.4
[52] htmlwidgets_1.5.4 httr_1.4.2 RColorBrewer_1.1-3
[55] ellipsis_0.3.2 ica_1.0-2 farver_2.1.0
[58] pkgconfig_2.0.3 sass_0.4.1 uwot_0.1.11
[61] dbplyr_2.1.1 deldir_1.0-6 utf8_1.2.2
[64] labeling_0.4.2 tidyselect_1.1.2 rlang_1.0.2
[67] reshape2_1.4.4 later_1.3.0 munsell_0.5.0
[70] cellranger_1.1.0 tools_4.1.2 cli_3.3.0
[73] generics_0.1.2 broom_0.7.12 ggridges_0.5.3
[76] evaluate_0.15 fastmap_1.1.0 yaml_2.3.5
[79] goftest_1.2-3 bit64_4.0.5 knitr_1.38
[82] fs_1.5.2 fitdistrplus_1.1-8 RANN_2.6.1
[85] pbapply_1.5-0 nlme_3.1-153 mime_0.12
[88] xml2_1.3.3 compiler_4.1.2 rstudioapi_0.13
[91] plotly_4.10.0 png_0.1-7 ggsignif_0.6.3
[94] spatstat.utils_2.3-0 reprex_2.0.1 bslib_0.3.1
[97] stringi_1.7.6 highr_0.9 lattice_0.20-45
[100] Matrix_1.3-4 ggsci_2.9 vctrs_0.4.1
[103] pillar_1.7.0 lifecycle_1.0.1 spatstat.geom_2.4-0
[106] lmtest_0.9-40 jquerylib_0.1.4 RcppAnnoy_0.0.19
[109] data.table_1.14.2 cowplot_1.1.1 irlba_2.3.5
[112] httpuv_1.6.5 patchwork_1.1.1 R6_2.5.1
[115] promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3
[118] parallelly_1.30.0 codetools_0.2-18 MASS_7.3-54
[121] assertthat_0.2.1 withr_2.5.0 sctransform_0.3.3
[124] mgcv_1.8-38 parallel_4.1.2 hms_1.1.1
[127] grid_4.1.2 rpart_4.1-15 rmarkdown_2.13
[130] carData_3.0-5 Rtsne_0.15 shiny_1.7.1
[133] lubridate_1.8.0